-
Notifications
You must be signed in to change notification settings - Fork 2
Create versioned releases of pandoc-wasm #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
.github/workflows/build.yml
Outdated
repository: haskell-wasm/pandoc | ||
ref: wasm | ||
repository: jgm/pandoc | ||
ref: main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are going to put a version number on these builds, shouldn't they also be pinned to a specific commit (probably a release tag) for the underlying Pandoc version too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alerque Yes, you are right. I am trying to figure out how to best do this. It should probably be easy to get it to build with the most recent pandoc version and we probably need another digit to the number to show the version of the pandoc-wasm package. So 3.6.3.x
instead of just 3.6.3
. If you have a proposal of how to do the versioning in the most standard complying and simple way - I'm all for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adding a segment will make it hard to parse because Pandoc follows PVP and has a variable number of segments. I think you're going to need a different segment operator that plays nice with distro versioning (I think +
is the most robust option). The question is probably does it make more sense to version this project first, then append the relevant Pandoc version, or the other way around? e.g. pandoc-wasm-3.6.3+0.1
or pandoc-wasm-0.1+3.6.3
. I think the latter probably makes more sense but it depends on the expected release channels and use workflows I guess. I don't really have a handle on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alerque OK, as someone working mainly with JS/TS in browsers, I would expect it to become available in npm. But I'm open to others needing it other places - maybe? I like both of your versioning proposals, and unless there is reason not to do so, I'd go for the second one then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alerque This seems to now be working with tagged versions, etc. . Only thing remaining is put it it on npm. As I assume that @TerrorJack or @tweaf or @alerque will want to have control over the npm repository after merging this PR (or writing something similar), I will not add that part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alerque It looks like npm does not allow this versioning scheme. It only allows 3 digit semver. So I split the two numbering apart. It's still really easy to update the pandoc version though.
@georgestagg It looks like your archived |
Yes, you have my permission. Which npm username should I invite as a maintainer for the package? |
@georgestagg I'll only do it if none of those who really developed this package will do it. My username is johanneswilm. But it would be better if @alerque @TerrorJack or someone else here would do it.I know very little about haskell and wasm. |
patch/pandoc.patch
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you used haskell-wasm/pandoc#1 to create this, makes sense 👍
Now that I see the minimized diff again, I realize that the only remaining thing that is actually patched in the Haskell code here is the addition of the wasm_main
export, such that the same .wasm
file can be used both as a command module (e.g. locally via wasmtime
) and as a reactor (for the web app). While this is quite neat, having separate .wasm
files for these would have the following advantages:
- Apart from the removal of
-threaded
(which could be upstreamed behind aarch(wasm32)
conditional), no pandoc patching would be necessary, one could just build stockpandoc-cli
(of course with an appropriatecabal.project
for various dependencies). - In the separate build that exposes an FFI, one could actually use the full Wasm JSFFI which is more convenient.
@johanneswilm In your use case, do you use the .wasm
file as a command or a reactor module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amesgen That sounds promising. Yes, I took your patch in order to make the diff smaller.
My use case is this: I have created an open source word processor similar to Google Docs/Microsoft Word 365 Online, etc. for a specific niche market. I have written a number of export filters myself to common formats (like DOCX, ODT, HTML, EPUB, JATS, etc.). These are written in JS and all run in the end users browser. I have even written an import filter for ODT files that works the same way.
Now I have a number of users wanting to import and export from other, more exotic formats. So I've created import/export filters in JS to the pandoc internal json format. And I then let the client send the pandoc json to a server where pandoc is run in server mode, converting the json to one of the other formats and sending it back to the client.
This is a bit problematic as these conversions take up processing power on the server and it's quite complex to deploy pandoc to a number of different architectures for which there are no pre-compiled binaries. There could even be security issues about sending various files back and forth.
So my idea is to instead use this, if the user clicks on export to or import from an "exotic" format for the first time, the browser will download the pandoc.wasm file (to cache for future use) and then to do the conversion in the users own browser instead - taking the users own processing power and not that of the server.
I assume you are referring to the terms "reactor" and "command" as they are defined here [1]. Given that I only want to execute the conversion based on one input and then to close down again (only caching the binary so it doesn't have to be downloaded again), I assume this corresponds to pandoc-cli more than pandoc-server.
I haven't yet tried whether it is actually possible. Based on the web demo it seems like it should though.
I don't know the who-is-who of the haskell-wasm world either. So I don't know which one of all of you can make any decisions here and who would be a good candidate to maintain an npm package of pandoc-wasm. I do maintain several open source packages, but those are written in languages that I use daily (like JS/TS or Python). So if one of you would want to step forward and do this, I'd be in favor.
This reverts commit b4f0480.
@TerrorJack I had to pin ghc-wasm-meta to a specific git commit as your recent changes there seem to have broken the pandoc wasm build. By pinning ghc-wasm-meta, I was able to build it with pandoc 3.6.4 (latest). |
wasi_snapshot_preview1: wasi.wasiImport | ||
}) | ||
|
||
wasi.initialize(instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a good idea to initialize a new instance for every call to pandoc? I am not 100% sure it is related, but I keep running into out of memory errors with this approach.
My use case is an HTML WYSIWYG editor that converts its HTML to Typst on every change, so the pandoc method is called a lot.
The previous script reused one instance for all conversions and I did not run into any out of memory errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. So that means that previous instances are not garbage collected? Do you happen to keep references to the old instances anywhere?
Why the change - it has been a while since I wrote this part. Let me try to reconstruct why it was done this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NiklasEi I think it may have to do with the fact that the conversion process can produce several more output files - for example image files with differing names. So if you reuse the instance, the media folder will have both the images from previous and the current run. I don't know this wasi shim enough to say whether it's possible to "clean" the folders after each run - that may be possible, but I didn't find much documentation at the time.
How exactly does the out-of-memory issue manifest? Do you get a popup from the browser saying it is out of memory? If you keep the instance around, you'd have to make sure everything is synchronous (no worker and conversion 2 has to wait until conversion 1 is fully done)+ the cleanup. And you'd at least have to keep one wasi instance around at all times in memory - which I understand is ok in your case.
I use it for something fairly similar, I only convert upon user request. And if the garbage collection is working OK, then the instance should disappear again from memory after the conversion. This should be even more important if you do lots of conversions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @NiklasEi
I am not reading or writing any files other than input/output.
I understand, but some of the transformations will create these extra files in the internal file system.
As for my other question: are you keeping references to anything that is related to running this in your JavaScript code? I am asking because keeping a reference will do that the garbage collector will not be able to release the memory consumed by a previous instance, AFAIU.
I cannot share the stacktrace but there doesn't seem to be anything interesting in it anyways.
Ok, so you are working on a closed source competitor to my open source software and want my help in debugging it without being able to share many details? :)
Seriously though, I understand that you may have a reason to do things closed source. But that also makes it difficult to help for people outside of your company.
As I understand it, you are not even entirely sure why you have a problem and whether it is related to this at all. I have incorporated a ton of changes in this release, this being one of many. It doesn't really make sense for me to spend another day or three on reverting this change and then solving all the resulting problems in some other way - only for you to then to discover that your issue is something entirely different.
My recommendation would be for you to try to create a minimal example that you can publish openly that exhibits the problem and then I can also help you with resolving the issue you are facing if it turns out to be a bug in this library.
Please file future issues with https://github.com/fiduswriter/wasm-pandoc/tree/main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I did some memory testing with Fidus Writer using the Chromium development console memory tab. Memory usage goes up by about 130MB during conversion with pandoc (236MB to 366MB). When the conversion is done, the entire 130MB are released and it goes back down to 236MB. I've tried exporting a number of times - the memory usage always goes back down to the same amount. But I made sure not to keep any reference to a previous instance of a previous run anywhere in my code so that the garbage collector can release the memory consumed by pandoc.
Hey @TerrorJack @tweag @amesgen
This is the kind of thing I was thinking of in #9 . The changes are basically:
Include a patch to the pandoc sources and link to the original pandoc repository. No need to maintain a fork of the pandoc sources. Only the patch may need to be adjusted every now and then.
Publish versioned release files (for example by using
{pandoc-version}.{pandoc-wasm-version}
. Currently that would be3.6.3.0
.Ideally, these releases would also be pushed to something like npm or some other repository for wasm files.
I understand a lot is going on in the wasm space and you expect changes on how this is done in the not too distant future. But meanwhile, these releases already serve a purpose so it would make sense to distribute them.